the nbro's blog

Optionals in Java

Mon, 29 Apr 2024 00:00:00 +0000

In Java, optionals (introduced in Java 1.8) are not strictly needed. In fact, you can write code that uses optionals that is as verbose or even more verbose than without them.

Without optionals

Integer i = null;

// Just do a usual null check 
// before calling a method or property on the object
if(i != null) {
    System.out.println(i.toString());
} else {
    i = 1;
    System.out.println(i.toString());
}

With optionals

Optional<Integer> o = Optional.ofNullable(null);

// This is similar to a null check
if(o.isPresent()) {
    System.out.println(o.get().toString());
} else {
    o = Optional.ofNullable(1);
    System.out.println(o.get().toString());
}

In my view, optionals were not optimally implemented in Java because of this and because we have both of and ofNullable to create optionals. I don’t see why we need of, which raises an exception if the object is null. Clearly, if we want to use optionals, we know that they may be null (or empty), so we would just need ofNullable, which doesn’t raise an exception if the argument is null!

So, given these bad design decisions, why would people even use optionals?

I see only 2 main advantages of optionals

We can use the method orElse to get default values (instead of using isPresent combined with get). The example above would be equivalent
```
 Optional<Integer> opt = Optional.ofNullable(null);
 System.out.println(opt.orElse(1).toString());
```
They encourage you to think of the absence of values.
- However, null pointer exceptions were already infamous in Java and people should already be forced to do null checks.

There are other methods in Optional that people may occasionally find useful, but in many cases maybe not. So, I think it’s more a matter of taste or convention whether you use optionals or not. In other words, Java optionals may be optional ¹. I may change opinion the more I use Java and optionals, so I may update this post accordingly.

In Rust, I like optionals because Rust programs are written with optionals everywhere and they seem natural. ↩

MDPs are Markov Games

Sun, 17 Mar 2024 00:00:00 +0000

MDPs

A Markov Decision Process (MDP) is the mathematical framework that can be used to model (sequential) stochastic decision-making problems, which are solved by a single (Reinforcement Learning) agent. Formally, an MDP can be defined as a tuple

\[\text{MDP} = (\mathcal{S}, \mathcal{A}, T, r, \gamma),\]

where

\(\mathcal{S}\) is the state space
\(\mathcal{A}\) is the action space
\(T = p(s' \mid s, a)\) is the transition function
\(r(s, a)\) is the reward function
\(\gamma\) is the discount factor

Markov Games

A Markov Game (MG) (also called Stochastic Game) is a mathematical formalism to describe problems where we have multiple agents (or players) that interact (cooperate, compete or both) with each other, which is the type of problem studied in Game Theory. Formally, an MG with \(n\) agents can be defined as a tuple

\[\text{MG} = (\mathcal{S}, \mathbb{A}, \Gamma, \mathbb{r}, \gamma),\]

where

\(\mathcal{S}\) is the state space
\(\mathbb{A} = \{\mathcal{A}_1, \dots, \mathcal{A}_n\}\) is a set of action spaces, one for each of the \(n\) agents
\(\Gamma = p(s' \mid s, a_1, \dots, a_n)\) is the transition function
\(\mathbb{r} = \{ r_1(s, a_1, \dots, a_n), \dots, r_n(s, a_1, \dots, a_n) \}\) is a set of reward functions, one for each agent
\(\gamma\) is the discount factor

MDPs are Markov Games

So, what are the differences between an MDP and an MG?

In an MG, we have a set of action spaces, one for each agent, while in an MDP we have only one action space,
In an MG, the transition function conditions on the action of all players, and
In an MG, we have a set of reward functions, one for each agent, where each reward function also conditions on the action of all players.

An MDP can therefore be viewed as a Markov Game with one player.

Bellman Equations

Wed, 16 Feb 2022 00:00:00 +0000

Introduction

In Reinforcement Learning (RL), value functions define the objectives of the RL problem. There are two very important and strictly related value functions,

the state-action value function (SAVF) ¹, and
the state value function (SVF).

In this post, I’ll show how these value functions (including their optimality versions) can be mathematically formulated as recursive equations, known as Bellman equations.

I’ll assume that you are minimally familiar with Markov Decision Processes (MDPs) and RL. Nevertheless, I will review the most important RL and mathematical prerequisites to understand this post, so that the post is self-contained as much as possible.

Notation

Stylized upper case letters (e.g. \(\mathcal{X}\) or \(\mathbb{R}\)) denote vector spaces.
Upper case letters (e.g. \(X\)) denote random variables.
Depending on the context, lower case letters (e.g. \(x\)) can denote realizations of random variables, variables of functions, or elements of a vector space.
Depending on the context, \(\color{blue}{p}(s' \mid s, a)\) can be a shorthand for \(\color{blue}{p}(S'=s' \mid S=s, A=a) \in [0, 1]\), a probability, or \(\color{blue}{p}(s' \mid S=s, A=a)\), a conditional probability distribution.
\(X=x\) is an event, which is occasionally abbreviated as \(x\).
\(s' \sim \color{blue}{p}(s' \mid s, a)\) means that \(s'\) is drawn/sampled according to \(\color{blue}{p}\).

Markov Decision Processes

A (discounted) MDP can be defined as a tuple

\[M \triangleq (\mathcal{S}, \mathcal{A}, \mathcal{R}, \color{blue}{p}, \mathscr{r}, \gamma) \tag{1} \label{1},\]

\(\mathcal{S}\) is the state space,
\(\mathcal{A}\) is the action space,
\(\mathcal{R} \subset \mathbb{R}\) is the reward space,
\(\color{blue}{p}(s' \mid s, a)\) is the transition model,
\(\mathscr{r}(s, a) = \mathbb{E}_{p(r \mid s, a)} \left[ R \mid S = s, A = a \right]\) is the expected reward, and
\(\gamma \in [0, 1]\) is the discount factor.

Markov Property

MDPs assume that the Markov property holds, i.e. the future is independent of the past given the present. The Markov property is encoded in \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\).

Finite MDPs

A finite MDP is an MDP where the state \(\mathcal{S}\), action \(\mathcal{A}\) and rewad \(\mathcal{R}\) spaces are finite sets. In that case, \(\color{blue}{p}(s' \mid s, a)\) can be viewed as a probability mass function (pmf) and the random variable associated with states, actions and rewards are discrete.

Alternative Formulations

Sometimes, \(\color{blue}{p}(s' \mid s, a)\) is combined with \(\mathscr{r}(s, a)\) to form a joint conditional distribution, \(\color{purple}{p}(s', r \mid s, a)\) (the dynamics of the MDP), from which both \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\) can be derived.

Specifically, \(\color{blue}{p}(s' \mid s, a)\) can be computed from \(\color{purple}{p}(s', r \mid s, a)\) by marginalizing over \(r\) ² as follows

\[\color{blue}{p}(s' \mid s, a) = \sum_{r \in \mathcal{R}}r \color{purple}{p}(s', r \mid s, a). \tag{2} \label{2}\]

Similarly, we have

\[\begin{align*} \mathscr{r}(s, a) &= \mathbb{E} \left[ R \mid S = s, A = a \right] \\ &= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} \color{purple}{p}(s', r \mid s, a) \\ &= \sum_{r \in \mathcal{R}} r p(r \mid s, a). \end{align*} \tag{3} \label{3}\]

Reinforcement Learning

In RL, we imagine that there is an agent that sequentially interacts with an environment in (discrete) time-steps, where the environment can be modelled as an MDP.

More specifically, at time-step \(t\), the agent is in some state \(s_t \in \mathcal{S}\) ³ and takes an action \(a_t \in \mathcal{A}\) with a policy \(\color{red}{\pi}(a \mid s)\), which is a conditional probability distribution over actions given a state, i.e. \(a_t \sim \pi(a \mid s_t)\). At the next time step \(t+1\), the environment returns a reward \(r_{t+1} = \mathscr{r}(s_t, a_t)\), and it moves to another state \(s_{t+1} \sim \color{blue}{p}(s' \mid s_t, a_t)\), then the agent takes another action \(a_{t+1} \sim \color{red}{\pi}(a \mid s_{t+1})\), gets another reward \(r_{t+2} = \mathscr{r}(s_{t+1}, a_{t+1})\), and the environment moves to another state \(s_{t+2}\), and so on. This interaction continues until a maximum time-step \(H\), which is often called horizon, is reached. For simplicity, we assume that \(H = \infty\), so we assume a so-called infinite-horizon MDP.

In RL, the objective/goal is to find a policy that maximizes the sum of rewards in the long run, i.e. until the horizon \(H\) is reached (if ever reached). An objective function that formalizes this sum of rewards in the long run is the state-action value function (SAVF), which is, therefore, one function that we might want to optimize.

State-Action Value Function

The state-action value function for a policy \(\color{red}{\pi}(a \mid s)\) is the function \(q_\color{red}{\pi} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\), which is defined as follows

\[q_\color{red}{\pi}(s, a) \triangleq \mathbb{E} \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a\right], \\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A} \tag{4}\label{4},\]

where

\(R_{t+k+1}\) is the reward the agent receives at time-step \(t+k+1\),
\(G_t \triangleq \sum_{k=0}^\infty \gamma^k R_{t+k+1}\) is the return (aka value ⁴),
\(\color{red}{\pi}\) is a policy that the agents follows from time-step \(t+1\) onwards,
\(\gamma \in [0, 1)\) is the discount factor,
\(S_t = s\) is the state the agent is in at time-step \(t\), and
\(A_t = a\) is the action taken at time-step \(t\).

Intuitively, \(q_\color{red}{\pi}(s, a)\) is expected return that the agent gets by following policy \(\color{red}{\pi}\) after having taken action \(a\) in state \(s\) at time-step \(t\).

Value Function of a Policy

The subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) indicates that \(q_\color{red}{\pi}(s, a)\) is defined in terms of \(\color{red}{\pi}(a \mid s)\) because the rewards received in the future, \(R_{t+k+1}\), depend on the actions that we take with \(\color{red}{\pi}(s \mid a)\), but they also depend on the transition model \(\color{blue}{p}(s' \mid s, a)\).

However, \(\color{red}{\pi}\) and \(\color{blue}{p}\) do not appear anywhere inside the expectation in equation \ref{4}. So, for people that only believe in equations, \ref{4} might not be satisfying enough. Luckily, we can express \(q_\color{red}{\pi}(s, a)\) in terms of \(\color{red}{\pi}\) and \(\color{blue}{p}\) by starting from equation \ref{4}, which also leads to a Bellman/recursive equation. So, let’s do it!

Mathematical Prerequisites

The formulation of the value functions as recursive equations (which is the main point of this blog post) uses three main mathematical rules, which are reviewed here for completeness.

Markov Property

If we assume that the Markov property holds, then the following holds

\[\color{blue}{p}(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = \color{blue}{p}(s_{t+1} \mid s_t, a_t)\]
\[\mathscr{r}(s, a) = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0 \right] = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t \right]\]
\[\color{red}{\pi}(a_{t} \mid s_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0 ) = \color{red}{\pi}(a_{t} \mid s_t)\]

Linearity of Expectation (LE)

Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution, then the expectation of \(X+Y\) is equal to the sum of the expectation of \(X\) and \(Y\), i.e.

\[\begin{align*} \mathbb{E}[X + Y] &= \sum_x \sum_y (x + y) p(x, y) \\ &= \sum_x \sum_y x p(x, y) + y p(x, y) \\ &= \sum_x \sum_y x p(x, y) + \sum_x \sum_y y p(x, y) \\ &= \sum_x x p(x) + \sum_y y p(y) \\ &= \mathbb{E}[X] + \mathbb{E}[Y] \end{align*}\]

Law of Total Expectation (LTE)

The formulation of the LTE is as follows. Let \(X\), \(Y\) and \(Z\) be three discrete random variables, \(\mathbb{E}[X \mid Y=y]\) be the expectation of \(X\) given \(Y=y\), and \(p(x, y, z)\) be the joint of \(X\), \(Y\) and \(Z\). So, we have

\[\begin{align*} \mathbb{E}[X \mid Y=y] &= \sum_x x p(x \mid y) \\ &= \sum_x x \frac{p(x, y)}{p(y)} \\ &= \sum_x x \frac{\sum_z p(x, y, z)}{p(y)} \\ &= \sum_x x \frac{\sum_z p(x \mid y, z) p(y, z) }{p(y)} \\ &= \sum_z \frac{p(y, z)}{p(y)} \sum_x x p(x \mid y, z) \\ &= \sum_z p(z \mid y) \sum_x x p(x \mid y, z) \\ &= \sum_z p(z \mid y) \mathbb{E}[X \mid Y=y, Z=z]. \end{align*}\]

This also applies to other more complicated cases, i.e. more conditions, or even to the simpler case of \(\mathbb{E}[X]\).

State-Action Bellman Equation

We are finally ready to express \(\ref{4}\) as a recursive equation.

We can decompose the return \(G_t\) into the first reward \(R_{t+1}\), received after having taken action \(A_t = a\) in state \(S_t = s\), and the rewards that we will receive in the next time steps, then we can apply the LE, LTE (multiple times) and Markov property, i.e.

\[\begin{align*} q_\color{red}{\pi}(s, a) &= \mathbb{E} \left[ R_{t+1} + \sum_{k=1}^\infty \gamma^k R_{t+k+2} \mid S_t = s, A_t = a\right] \\ &= \mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', S_t = s, A_t = a\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') q_\color{red}{\pi}(s', a') \tag{5}\label{5}, \end{align*}\]

where

\(\lambda G_{t+1}=\sum_{k=1}^\infty \gamma^k R_{t+k+2} = \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2}\),
\(\mathscr{r}(s, a) \triangleq \mathbb{E}_{p(r \mid s, a)} \left[ R_{t+1} \mid S_t = s, A_t = a \right] = \sum_{r} p(r \mid s, a) r\) is the expected reward of taking action \(a\) in \(s\) and \(p(r \mid s, a)\) is the reward distribution ⁵,
\(v_\color{red}{\pi}(s') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right]\) is the state value function (SVF), and
\(q_\color{red}{\pi}(s', a') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right]\).

The equation \ref{5} is a recursive equation, given that \(q_\color{red}{\pi}\) is defined in terms of itself (although evaluated at a different state-action pair), known as the state-action Bellman (expectation) equation for \(q_\color{red}{\pi}\).

So, the subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) is used because the state-action value is (also) defined in terms of \(\color{red}{\pi}\).

Alternative Version

Given the relations \ref{2} and \ref{3}, equation \ref{5} can also be expressed in terms of \(\color{purple}{p}(s', r \mid s, a)\) as follows.

\[\begin{align*} q_\color{red}{\pi}(s, a) &= \sum_{r} \underbrace{\sum_{s'} \color{purple}{p}(s', r \mid s, a)}_{p(r \mid s, a)} r + \gamma \sum_{s'} \underbrace{\sum_{r} \color{purple}{p}(s', r \mid s, a)}_{\color{blue}{p}(s' \mid s, a)} v_\color{red}{\pi}(s') \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \tag{6}\label{6} . \end{align*}\]

So, \(q_\color{red}{\pi}(s, a)\) is an expectation with respect to the joint conditional distribution \(\color{purple}{p}(s', r \mid s, a)\), i.e.

\[q_\color{red}{\pi}(s, a) = \mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{red}{\pi}}(S_{t+1}) \mid S_t = s, A_t =a\right] \tag{7}\label{7} .\]

Equation \ref{6} can also be derived from equation \ref{4} by applying the LTE with respect to \(\color{purple}{p}(s', r \mid s, a)\).

Vectorized Form

If the MDP is finite, then we can express the state-action Bellman equation in \ref{5} in a vectorized form

\[\mathbf{Q}_\color{red}{\pi} = \mathbf{R} + \gamma \mathbf{\color{blue}{P}} \mathbf{V}_\color{red}{\pi}, \tag{8}\label{8}\]

where

\(\mathbf{Q}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix that contains the state-action values for each state-action pair \((s, a)\), so \(\mathbf{Q}_\color{red}{\pi}[s, a] = q_{\color{red}{\pi}}(s, a)\),
\(\mathbf{R} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix with the expected rewards for each state-action pair \((s, a)\), so \(\mathbf{R}[s, a] = \mathscr{r}(s, a)\),
\(\mathbf{\color{blue}{P}} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid \times \mid \mathcal{S} \mid}\) is a matrix that contains the transition probabilities for each triple \((s, a, s')\), so \(\mathbf{\color{blue}{P}}[s, a, s'] = \color{blue}{p}(S'=s' \mid S=s, A=a)\), and
\(\mathbf{V}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid}\) is a vector that contains the state values (as defined in equation \ref{5}) for each state \(s'\), so \(\mathbf{V}_\color{red}{\pi}[s'] = v_\color{red}{\pi}(s')\).

Optimal State-Action Value Function

In RL, the goal is to find/estimate an optimal policy, \(\color{green}{\pi_*}\), i.e. one that, if followed, maximizes the expected return. For a finite MDP, there is a unique optimal state-action value function, which can be denoted by \(q_{\color{green}{\pi_*}}(s, a)\) or just \(q_\color{green}{*}(s, a)\), from which an optimal policy can be derived.

By definition, the optimal state-action value function is

\[q_{\color{green}{\pi_*}}(s, a) \triangleq \operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a), \\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A} \tag{9}\label{9} .\]

For a discounted infinite-horizon MDP, the optimal policy is deterministic ⁶ and stationary ⁷, and it’s any greedy policy with respect to \(q_{\color{green}{\pi_*}}(s, a)\), i.e.

\[\color{green}{\pi_*} \in \operatorname{arg max}_a q_{\color{green}{\pi_*}}(s, a), \\ \color{orange}{\forall} s \in \mathcal{S} \tag{10}\label{10}\]

Here, \(\in\) is used because there can be more than one optimal policy for an MDP given that there can be two or more actions that are optimal in a state.

State-Action Bellman Optimality Equation

Equation \ref{9} can also be written as a recursive equation, known as the state-action Bellman optimality equation.

\[\begin{align*} q_{\color{green}{\pi_*}}(s, a) &\triangleq \operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a) \\ &= \operatorname{max}_\color{red}{\pi} \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') \right] \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a') \right] \\ &= \mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{green}{\pi_*}}(S_{t+1}) \mid S_t = s, A_t =a\right] \tag{11}\label{11}, \end{align*}\]

where \(v_\color{green}{\pi_*}(s') = \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') = \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a')\).

State Bellman Equation

Like \(q_\color{red}{\pi}(s, a)\), the state value function \(v_\color{red}{\pi}(s)\) can also be written as a recursive equation by starting from its definition and then applying the LTE rule, the linearity of the expectation and the Markov property. So, for completeness, let’s do it.

\[\begin{align*} v_\color{red}{\pi}(s) &\triangleq \mathbb{E} \left[ G_t \mid S_t=s\right] \\ &= \mathbb{E} \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s \right] \\ &= \mathbb{E} \left[ R_{t+1} \mid S_t=s \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t=s \right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1} = s'\right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \left ( \mathscr{r}(s, a) + \gamma \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \right) \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \left ( \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \right) \tag{12}\label{12}. \end{align*}\]

Conclusion

In conclusion, value functions define the objectives of an RL problem. They can be written as recursive equations, known as Bellman equations, in honor of Richard Bellman, who made significant contributions to the theory of dynamic programming (DP), which is related to RL.

More specifically, DP is an approach that can be used to solve MDPs (i.e. to find \(\color{green}{\pi_*}\)) when \(\color{blue}{p}\) is available, but not only ⁸, where the solution to a problem can be computed by combining the solution to subproblems: the Bellman equation really reflects this idea, i.e. \(q_\color{red}{\pi}(s, a)\) is computed as a function of the “subproblem” \(q_\color{red}{\pi}(s', a')\). The problem is that \(\color{blue}{p}\) is rarely available, hence the need for RL. In any case, DP algorithms, like policy iteration (PI), and RL algorithms, like Q-learning, are related because they assume that the environment can be modeled as an MDP and they attempt to estimate the same optimal value function.

It’s also called action value function, but I prefer to call it state-action value function because it reminds us that this is a function of a state and action. ↩
Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution. The marginal distribution of \(X\) or \(Y\) can be found as \(p(x) = \sum_y p(x, y)\) and \(p(y) = \sum_x p(x, y)\), respectively. ↩
The subscript \(t\) in the object \(x_t\) is used to emphasize that \(x\) is associated with the time step \(t\). ↩
Hence the name value function. ↩
If the reward is deterministic, then \(p(r \mid s, a)\) gives probability \(1\) to one reward and \(0\) to all other rewards. ↩
\(\pi(a \mid s)\) can also be used to describe deterministic policies by giving a probability of \(1\) to one action in \(s\) and a probability of \(0\) to all other actions. A deterministic policy might also be defined as a function \(\pi : \mathcal{S} \rightarrow \mathcal{A}\), so \(\pi(s) = a\) is the (only) action taken by \(\pi\) in \(s\). ↩
A policy \(\color{red}{\pi}(a \mid s)\) is stationary if it doesn’t change over time steps, i.e. \(\color{red}{\pi}(a \mid S_{t} = s) = \color{red}{\pi}(a \mid S_{t+1} = s), \forall t, \forall s \in \mathcal{S}\), in other words, the probabilities of selecting an action do not change from time step to time step. You can think of a non-stationary policy as a set of policies. ↩
For more info about the dynamic programming approach, I recommend that you read the corresponding chapter in the book Introduction to Algorithms (3rd edition) by Thomas H. Cormen et al. ↩

MDPs are POMDPs

Sat, 01 Jan 2022 00:00:00 +0000

A (fully observable) Markov Decision Process (MDP) is just a Partially Observable Markov Decision Process (POMDP) where the states are observable. So, we can formulate an MDP as a POMDP such that the observation space is equal to the state space. We also need to take care of the observation function. Let’s see how exactly.

Formally, an MDP can be defined as a tuple \(M_\text{MDP} = (\mathcal{S}, \mathcal{A}, T, r, \gamma)\), where

\(\mathcal{S}\) is the state space
\(\mathcal{A}\) is the action space
\(T = p(s' \mid s, a)\) is the transition function
\(r\) is the reward function
\(\gamma\) is the discount factor

A POMDP is defined as a tuple \(M_\text{POMDP} = (\mathcal{S}, \mathcal{A}, T, r, \gamma, \color{red}{\Omega}, \color{red}{O})\), where \(\mathcal{S}\), \(\mathcal{A}\), \(T\), \(r\) and \(\gamma\) are defined as above, but, in addition to those, we also have

\(\color{red}{\Omega}\): the observation space
\(\color{red}{O} = p(o \mid s', a)\): the observation function, which is the probability distribution over possible observations, given the next state \(s'\) and action \(a\)

So, to define \(M_\text{MDP}\) as \(M_\text{POMDP}\), we have

\[\color{red}{\Omega} = \mathcal{S}\]
The observation function is \(\color{red}{O} = p(o \mid s', a) = \begin{cases} 1, \text{ if } o = s' \\ 0, \text{ otherwise } \end{cases}\)

In other words, the probability of observing \(o = s'\), given that we end up in \(s'\), is \(1\), while the probability of observing \(o \neq s'\) is \(0\). This has implications on how you update the belief state \(b(s')\) because \(b(s')\) will be set to \(0\) if \(o \neq s'\).

Historically relevant programs developed in LISP

Sat, 04 Dec 2021 00:00:00 +0000

LISP stands for List Processing. In this functional programming language, programs look like lists and can be treated as data (hence the name) ¹. It was designed by John McCarthy (one of the official founders of the AI field) starting in 1958.

Many people know that LISP is historically a very important programming language in Artificial Intelligence. Even today, dialects of LISP are still being used in this context. For example, Clojush is a Clojure (which is a dialect of LISP) implementation of the Push programming language and the PushGP system, which are still being used to do research on genetic programming.

Many historically relevant programs were implemented in LISP in the early days of AI. Here’s a non-exhaustive list ² ³.

Name	Author	Source	Year	Brief description/comment
Symbolic Automatic INTegrator (SAINT)	James R. Slagle	[1]	1963	A symbolic integretation program
ANALOGY	Thomas G. Evans	[2]	1964	It solves geometric analogy problems
Semantic Information Retrieval (SIR)	Bertram Raphael	[3]	1964	A “machine understanding” program
QA3	C. Cordell Green (and Robert Yates)	[4]	1969	A resolution-based deduction system, which was an attempt to improve on Raphael’s SIR; QA3 is the successor of QA2 and QA1
SEE	Adolfo Guzman-Arenas	[5]	1969	A program to segment a line drawing of a scene containing blocks into its constituents
DENDRAL	Edward Feigenbaum, Joshua Lederberg, Bruce Buchanan, Carl Djerassi, and others	[6], [7], [8]	1965-	A project, expert system or series of programs to help chemists identify the structure of molecules given their mass spectra and other expert knowledge
Stanford Research Institute Problem Solver (STRIPS)	Richard Fikes & Nils Nilsson	[9]	~1970	A planning system used in the Shakey robot
SHRDLU	Terry Winograd	[10]	1971	A NLP dialog system, which was only partially written in LISP
MYCIN	Edward (Ted) Shortliffe	[11]	~1970	An expert system that would consult with physicians about bacterial infections and therapy; MYCYN is a common suffix for antibacterial [12]; the specific version of LISP used was BBN-LISP
Language Interface Facility with Elliptical and Recursive Features (LIFER)	Gary Hendrix	[13]	1976	A program to interact with databases in a subset of natural language (e.g. English); the specific version of LISP used was INTERLISP, a successor of BBN-LISP

In addition to these programs, many of the implementations of the conceptual structures by Roger C. Schank were in LISP [8].

Later, LISP was also used by John Koza in the context of GP (but this was already in the 90s). In 1998, NASA also developed in LISP Works the “Remote Agent” (RA), a robotic system for planning and executing spacecraft actions, in the context of Deep Space 1 [8].

If you are aware of any LISP program developed in the early days of AI (50s-90s) that is not mentioned above, you can share it with us in the comment section below and I will include it in the table above.

I am not a LISP programmer, but 3-4 years ago I had implemented a simple plugin for Emacs in Emacs Lisp. ↩
Most of these programs are mentioned in the book The Quest for Artificial Intelligence: A History of Ideas and Achievements, (2009) by Nils J. Nilsson, which I’ve been reading and enjoying. ↩
Not all of these programs were fully implemented in LISP, and it’s possible that there also other implementations of these programs in other programming languages. ↩

Optimal value function of shifted rewards

Sun, 01 Nov 2020 00:00:00 +0000

Theorem

Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)

\[q_*(s,a) =\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)\tag{1}\label{1}.\]

If we add the same constant \(c \in \mathbb{R}\) to all rewards \(r \in \mathcal{R}\), then the new optimal state-action value function is given by

\[q_*(s, a) + k,\]

where

\[k = \frac{c}{1 - \gamma} = c\left(\frac{1}{1 - \gamma}\right) = c \left( \sum_{i=0}^{\infty} \gamma^{i} \right) = c \left( 1 + \gamma + \gamma^2 + \gamma^3 + \dots \right),\]

where \(0 \leq \gamma < 1\) is the discount factor of the MDP and \(\sum_{i=0}^{\infty} \gamma^{i}\) is a geometric series ¹.

Assumptions

\(0 \leq \gamma < 1\); if we allowed \(\gamma = 1\), then \(\frac{c}{1 - \gamma} = c/0\), which is undefined.
For episodic problems ², we assume that we have an absorbing state \(s_\text{absorbing}\) ³, which is the state that the agent moves to after it has reached the goal, where the agent gets a reward of \(0\) for all future time steps. So, \(q_*(s_\text{absorbing}, a) =0, \forall a \in\mathcal{A}(s_\text{absorbing})\).

Proof

To show this, we need to show that the following equation is equivalent to the BOE in \ref{1}.

\[q_*(s,a) + k = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \gamma \max_{a' \in\mathcal{A}(s')} \left( q_*(s',a') + k \right) \right) \tag{2}\label{2}\]

Given that \(k = \frac{c}{1 - \gamma}\) is a constant, it does not affect the max, because we add this constant to all state-action values: this holds even if \(c\) is negative! So, we can take \(k\) out of the max and add it to \(\max_{a'\in\mathcal{A}(s')} q_*(s',a')\)

\[\begin{align*} q_*(s,a) + k &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \gamma \left (k + \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \frac{c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left(r + \frac{c(1 - \gamma) + c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left(r + \frac{c - c\gamma + c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left ( p(s',r \mid s,a)\frac{c}{1 - \gamma} \right) + \\ & \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left( p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \right). \tag{3}\label{3} \end{align*}\]

Given that \(p(s',r \mid s,a)\) is a probability distribution, then \(\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left ( p(s',r \mid s,a)\frac{c}{1 - \gamma} \right)\) is the expectation of the constant \(\frac{c}{1 - \gamma}\), which is equal to the constant itself.

So, equation \ref{3} becomes

\[q_*(s,a) + \frac{c}{1 - \gamma} = \frac{c}{1 - \gamma} + \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ \iff \\ q_*(s,a) =\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)\]

which is the Bellman optimality equation \ref{1}.

Interpretation

The result above suggests that, if we add a constant to all rewards, which is a form of reward shaping, the set of optimal policies does not change.

Is this always true? Yes, in theory.

However, we must be careful with episodic problems.

In theory, after we shift the rewards by \(c\), the agent will precisely get an additional reward of \(k = \frac{c}{1 - \gamma}\) for being in any state, including the absorbing state, and taking any action. So, after we shift the rewards, we have \(q_*(s_\text{absorbing}, a) = \frac{c}{1 - \gamma}, \forall a \in\mathcal{A}(s_\text{absorbing})\).
In practice, one might mis-specify the reward functions, if we shift the rewards and terminate the episode once the agent gets to \(s_\text{absorbing}\).

Example

To illustrate this issue, let’s say that, for an episodic problem (for example, a problem where the agent is in a grid and needs to go to a goal location), we have the following (deterministic) reward function

\[r(s, a) = \begin{cases} 1, \text{if } s = s_\text{goal}\\ 0, \text{if } s = s_\text{absorbing}\\ 0, \text{otherwise} \\ \end{cases}\]

\(s_\text{absorbing}\) is just the state that we assume the agent moves to after having reached the goal state, so that he continues to get a reward of \(0\), which is an assumption that we make that allows us to terminate the episode once we get to \(s_\text{goal}\).

Now, let’s say that we define a new reward function as \(r'(s, a) \triangleq r(s, a) - 1\), i.e.

\[r'(s, a) = \begin{cases} 0, \text{if } s = s_\text{goal}\\ -1, \text{if } s = s_\text{absorbing}\\ -1, \text{otherwise} \\ \end{cases}\]

So, in theory, you don’t get a reward of \(0\) anymore after the agent gets to the goal with \(r'\). If you terminate the episode once the agent gets to the goal, this will not be taken into account, i.e. if you terminate the episode once the agent got to the goal, then you assume that \(r'(s_\text{absorbing}, a) = 0, \forall a \in \mathcal{A}(s_\text{absorbing})\), i.e. you’re actually optimizing

\[r''(s, a) = \begin{cases} 0, \text{if } s = s_\text{goal}\\ 0, \text{if } s = s_\text{absorbing}\\ -1, \text{otherwise} \\ \end{cases}\]

So, in practice, you might be optimizing a different objective function than the one you implicitly or unconsciously assumed. In this example, \(r''(s, a)\) is the reward function that encourages the agent to get to the goal as quickly as possible (because you get a penalty of \(-1\) for every time step that you have not reached the goal), so, in practice, \(r''(s, a)\) might be what you want to optimize, but, in general, you must be careful with reward misspecification ⁴!

\(k\) is also written as a geometric series to emphasize that \(k\) is similar to the discounted return, which is defined as \(G_t = \sum_{i=0}^{\infty} \gamma^{i}R_{t+1+i}\), where \(R_{t+1+i}\) is the reward at time step \(t+1+i\). If all rewards were equal to \(c\), then \(G_t = \frac{c}{1 - \gamma}\). ↩
See sections 3.3 and 3.4. (p. 54) of Sutton & Barto book (2nd edition) for more details about the difference between episodic and continuing problems and how they can be unified. ↩
The assumption of having an absorbing state is also made in Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al., which is a seminal paper on reward shaping, which cites the book Theory of Games and Economic Behavior (1944) by John von Neumann et al. to support the claim that, for single-step decisions (which I assume to be some kind of bandit problem), positive linear transformations of the utility function do not change the optimal decision/policy: if we combine the theorem in this blog post and the theorem in my previous blog post, we get a similar result. ↩
This idea of reward misspecification has been studied in the literature. For example, in the paper, Inverse Reward Design (2017) by Dylan Hadfield-Menell et al. the authors propose an approach to deal with proxy reward functions (i.e. the reward functions designed by the human, which might not be the reward functions that the human intended to define). ↩

On the definition of intelligence

Wed, 20 May 2020 00:00:00 +0000

Introduction

There are many people that claim that we still do not agree on a definition of intelligence (and thus what constitutes an artificial intelligence), with the usual argument that intelligence means something different for different people or that we still do not understand everything about (human or animal) intelligence. In fact, in the article What is artificial intelligence? (2007), John McCarthy, one of the official founders of the AI field, states

The problem is that we cannot yet characterize in general what kinds of computational procedures we want to call intelligent. We understand some of the mechanisms of intelligence and not others.

To understand all mechanisms of intelligence, some people, such as Jeff Hawkins, have been studying the human brain (which is the main example of a system that is associated with intelligence).

We might not know how we are intelligent (i.e. how the human brain makes us intelligent), but this does not mean that we can’t come up with a general definition of intelligence that comprises all forms of intelligence (that people could possibly refer to). In other words, you do not need to fully understand all mechanisms of intelligence in order to attempt to provide a general definition of intelligence. For example, theoretical physicists (such as Albert Einstein) do not need to understand all the details of physics in order to come up with general laws of physics that are applicable in most cases and that explain many phenomena.

Universal Intelligence

There has been at least one quite serious attempt to formally define intelligence (and machine intelligence), so that it comprises all forms of intelligence that people could refer to.

In the paper Universal Intelligence: A Definition of Machine Intelligence (2007), Legg and Hutter, after having researched many previously given definitions of intelligence, informally define intelligence as follows

Intelligence measures an agent’s ability to achieve goals in a wide range of environments

This definition favors systems that are able to solve many tasks, which are often known as artificial general intelligences (AGIs), than systems that are only able to solve a specific task, sometimes known as narrow AIs.

Mathematical Formalization

To understand why this is the case, let’s look at their simple mathematical formalization of this definition (section 3.3 of the paper)

\[\Gamma(\pi) := \sum_{\mu \in E} \frac{1}{2^{K(\mu)}} V_{\mu}^{\pi}\]

where

\(\Gamma(\pi)\) is the universal intelligence of agent \(\pi\)
\(E\) is the space of all computable reward summable environmental measures with respect to the reference machine \(U\) (roughly speaking, the space of all environments)
\(\mu\) is the environment (or task/problem)
\(V_{\mu}^{\pi}\) is the ability of the agent \(\pi\) to achieve goals in the environment \(\mu\)
\(K(\mu)\) is the Kolmogorov complexity of the environment \(\mu\)

Interpretation

We can immediately notice that the intelligence of an agent is a weighted combination of the ability to achieve goals in the environments (which represent the tasks/problems to be solved), where each weight is inversely proportional to the complexity of the environment (i.e. the difficulty of describing/solving the corresponding task). In other words, \(\Gamma(\pi)\) is defined as an expectation of \(V_{\mu}^{\pi}\) with respect to the probability distribution \(\frac{1}{2^{K(\mu)}}\), which Legg and Hutter call the universal distribution.

So, the higher the complexity of an environment, the less the ability of the agent to achieve goals in this environment contributes to the intelligence of the agent. In other words, the ability to solve a very difficult task successfully might not be enough to have high intelligence. You can have higher intelligence by solving many but simpler problems. Of course, an intelligent agent that solves all tasks optimally would be the optimal or perfect agent. AIXI, developed and formalized by Hutter, is actually an optimal agent (in some sense), but, unfortunately, it is incomputable (because it uses the Kolmogorov complexity)¹.

Consequently, according to this definition, we could say that all animals (and maybe even other biological organisms) are more intelligent than, for example, AlphaGo or DeepBlue, because all animals solve many problems, although they might not be as difficult as Go, while AlphaGo only solves Go ².

Open Questions

I like this definition of universal intelligence because it implies that humans (and other animals) are more (generally) intelligent than AlphaGo or any other computer program, but it raises at least 1-2 questions:

How would we measure the difficulty of a real-world environment?
So, in practice, can we really compare an animal with AlphaGo? Yes, we can with intelligent tests like the Turing test, but can we do it with \(\Gamma(\pi)\)? The answer to this question clearly depends on the answer to the question above.

Intelligence Tests

In the paper, they also discuss issues like intelligence tests and their relation to the definition of intelligence: that is, is an intelligence test sufficient to define intelligence, or is an intelligence test and a definition of intelligence distinct concepts?

Conclusion

In my view, it is unproductive to come up with new definitions of intelligence (unless it’s more generally applicable than the universal intelligence) or to avoid choosing one definition with the excuse that we don’t know what intelligence is. I know what intelligence is. It’s measured by \(\Gamma(\pi)\). So, I don’t need to know how we can create an agent that is (highly) intelligent before I know what intelligence is. It’s not matter of liking or not a definition, it’s a matter of defining a set of axioms or hypotheses and deriving other properties from them or test those hypotheses, respectively.

\(\Gamma(\pi)\) is also a function of the Kolmogorov complexity, but this is just a definition, i.e. it does not directly give you the instructions to develop intelligent agents. ↩
Note that, according to this definition, AlphaGo is still intelligent, but just not as intelligent as animals. ↩

Optimal value function of scaled rewards

Sun, 15 Sep 2019 00:00:00 +0000

Theorem

Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)

\[q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')}q_*(s',a')) \tag{1}\label{1}.\]

If we multiply all rewards by the same constant \(c > 0 \in \mathbb{R}\), then the new optimal state-action value function is given by

\[cq_*(s, a).\]

Proof

To prove this, we need to show that the following BOE

\[c q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma \max_{a'\in\mathcal{A}(s')} c q_*(s',a')). \tag{2}\label{2}\]

is equivalent to the BOE in equation \ref{1}.

Given that \(c > 0\), then

\[\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\max_{a'\in\mathcal{A}(s')}q_*(s',a'),\]

so \(c\) can be taken out of the \(\operatorname{max}\) operator.

Therefore, the equation \ref{2} becomes

\[\begin{align*} c q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma c \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}c p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= c \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &\iff \\ q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')), \end{align*} \tag{3}\label{3}\]

which is equal to the the Bellman optimality equation in \ref{1}, which implies that, when the reward is given by \(cr\), \(c q_*(s,a)\) is the solution to the Bellman optimality equation.

Interpretation

Consequently, whenever we multiply the reward function by some positive constant, which can be viewed as a form of reward shaping ¹, the set of optimal policies does not change ².

What if the constant is zero or negative?

For completeness, if \(c=0\), then \ref{2} becomes \(0=0\), which is true.

If \(c < 0\), then \(\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\min_{a'\in\mathcal{A}(s')}q_*(s',a')\), so equation \ref{3} becomes

\[q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \min_{a'\in\mathcal{A}(s')} q_*(s',a')),\]

which is not equal to the Bellman optimality equation in \ref{1}.

A seminal paper on reward shaping is Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al. ↩
There can be more than one optimal policy for a given optimal value function (and Markov Decision Process) because we might have \(q_*(s,a_1) = q_*(s,a_2) \geq q_*(s,a), \text{for } a_1, a_2 \in \mathcal{A}(s) \text{ and } \forall a \in \mathcal{A}(s)\). Any greedy policy with respect to the optimal value function \(q_*(s,a)\) is an optimal policy. ↩

An example of how to use VisualDL with PyTorch

Sun, 06 Jan 2019 00:00:00 +0000

Abstract

In this blog post, I will describe my journey while looking for visualization tools for PyTorch. In particular, I will briefly describe the options I tried out, and why I opted for VisualDL. Finally, and more importantly, I will show you a simple example of how to use VisualDL with PyTorch, both to visualize the parameters of the model and to read them back from the file system, in case you need them, e.g. to plot them with another tool (e.g. with Matplotlib).

Introduction

Yesterday, I have been trying to find and use a visualization tool, similar to TensorBoard, but for PyTorch. I find this type of visualization tools very useful, because they allow me to intuitively understand how the model is behaving and, in particular, how certain parameters and hyper-parameters of the model are changing, while the model is being trained and tested. So, this type of tools are especially useful while debugging our programs or if one needs to present the model to other people (e.g. teammates) while it is being trained or tested.

There isn’t still a “standard” tool for visualization in PyTorch (AFAIK). However, there are several “decent” options: TNT, tensorboardX or VisualDL. There may be other options, but these are, apparently, the most popular (that I found), according to the stars of the corresponding Github repositories.

If you perform a search on the web, you will find discussions and questions regarding visualization tools for PyTorch, for example this, this and this.

The options

I tried all three options I mentioned above.

TNT

I first tried to use TNT, which can be used for logging purposes during the training and testing process of our models. TNT actually uses Visdom (which is a quite flexible and general visualization tool created by Facebook) to display the info in the form of plots. Therefore, you also need Visdom as a dependency.

More specifically, I tried to run this TNT example. To successfully run the mentioned example using PyTorch 1.0.0, you need to modify a statement (which was already deprecated, but nobody cared to update the example), which causes an error. See this Github issue (on the TNT Github repo) for more info. I didn’t like much this example (and, generalizing, TNT) because the logic of the program drastically changed (with respect to a “usual” PyTorch program) only to perform the visualization of the evolution of e.g. the training loss. To me, this seemed like a sign of inflexibility. Therefore, I tried to look for other options.

tensorboardX

I then tried to use tensorboardX, which is, right now, among the three options, the one with the highest popularity (in terms of Github stars).

There are several examples which show how to use this tool. You can find them here and here. In particular, I tried this example. To use tensorboardX, we actually need to install TensorBoard and TensorFlow. Apart from the fact that these are not lightweight dependencies, this was also my first time using PyTorch, and the thought of needing to use TensorFlow (which I previously used in other projects) to achieve something in PyTorch made me think that I’d better just stick with TensorFlow and give up on PyTorch. Of course, this doesn’t make much sense, but I just didn’t feel this was the right direction. Therefore, I looked for another option.

VisualDL

Finally, I tried VisualDL (Visual Deep Learning), which is essentially a visualization tool very similar to TensorBoard, whose backend is written in C++, but with both a C++ and Python APIs. Its frontend or web interface is written in Vue. You can find its documentation at http://visualdl.paddlepaddle.org/. Nowadays, most machine learning frameworks and libraries are written in C++ and have a Python API, so, of course, this characteristic of VisualDL seems consistent with many other machine learning tools. One of the goals of this visualization tool is to be “cross-framework”, i.e. not to be tailored to a specific framework (like TensorFlow or PyTorch). I like flexibility, therefore this feature immediately biased me towards VisualDL. In the official website of the tool, it is claimed that VisualDL works with Caffe2, PaddlePaddle, PyTorch, Keras and MXNet. I imagine that in the future more frameworks will be supported.

I first read the README file of the Github repo of VisualDL, the documentation, this tutorial and this example. There are other examples (for other frameworks) you can find here. VisualDL seems to be in its preliminary phases, but you can already accomplish several things that you expect to accomplish with e.g. TensorBoard. For example, you can visualize (more or less in real-time) the evolution of scalar values of your model (e.g. the learning rate or the training loss), you can also plot histograms and visualize the computational graph.

However, this blog post is not dedicated to the explanation and presentation of all features of VisualDL, therefore I let the reader explore by him or herself the remaining features of VisualDL. In the next section, I will show you a brief example of how to use VisualDL with PyTorch and how to read the logging data, once it has been logged (and possibly visualized), using the API that VisualDL already provides.

How can VisualDL be used to visualize statistics of PyTorch models?

Before proceeding, you need to install PyTorch and VisualDL. In this example, where the source code can be found at https://github.com/nbro/visualdl_with_pytorch_example, I installed PyTorch and VisualDL in a Anaconda environment, but you can install them as you please. If you want to exactly follow along with my example, please, read the instructions here on how to set up your environment and run the example.

I will not describe all the details inside this example, but only the ones associated with the usage of VisualDL with PyTorch.

In the file write_visualdl_data.py, to use VisualDL to visualize the evolution of some of the metrics or statistics (specifically, the training loss, the test loss and the test accuracy) of the associated model (a CNN trained and tested on the MNIST dataset), I first imported the class LogWriter from visualdl (line 11):

from visualdl import LogWriter

I then created a LogWriter object (at line 13)

log_writer = LogWriter("./log", sync_cycle=1000)

where "./log" is the name of the folder where the logging files will be placed and sync_cycle is a parameter which controls when VisualDL will force the logging to be written to the file system. Have a look at the documentation for more info.

Then, at line 158 and 161, I defined the specific loggers (which are of type “scalars”, given that the training loss, the test loss and the test accuracy are scalar values) which will be used to record statistics during respectively the training and testing phases:

with log_writer.mode("train") as logger:
    train_losses = logger.scalar("scalars/train_loss")

with log_writer.mode("test") as logger:
    test_losses = logger.scalar("scalars/test_loss")
    test_accuracies = logger.scalar("scalars/test_accuracy")

What this piece of code tell us is that under the “mode” "train", we are defining the scalar logger train_losses which is associated with the “tag” "scalars/train_loss". Similarly for the loggers associated with the mode "test". VisualDL is actually aware of these modes: they will then be useful to retrieve the logging data from the file system (we will see this in the next example).

The specific scalar loggers train_losses, test_losses and test_accuracies are then passed to the functions train and test at lines 168 and 169. The functions train and test are called at every epoch (inside a loop). Inside the train function, at line 61, we add a “record” to the train_losses logger using:

train_losses.add_record(epoch, float(loss.item()))

Similarly, inside test, at line 86 and 87, we add respectively records for the test_losses and test_accuracies loggers.

test_losses.add_record(epoch, float(test_loss))
test_accuracies.add_record(epoch, float(test_accuracy))

The first argument of the add_record method is a “tag” or “id”, which is basically a key that will be needed later to retrieve the epoch associated with the corresponding record (which, in the examples above, is either a loss or an accuracy value). I converted the record values using the function float to make sure they are all floating-point values.

These are the only lines of code I needed to add to the original PyTorch program to obtain logging and visualization functionalities using VisualDL. More specifically, I added about 10 lines, and these lines are quite self-explanatory.

The following picture shows the resulting web interface of VisualDL, after having executed this example and after having waited for the log folder to be created and containing some logging files produced by VisualDL (as explained in this README file):

The screenshot does not completely show the bottom plot, but, of course, in the VisualDL web interface, you can scroll down. You can even expand single plots, among other things.

How can we read the logging data produced by VisualDL?

During the training and testing phases of your model, VisualDL will produce some logging files, in our case, under the folder log. These files are in a format which is not human-readable. They are files associated with ProtoBuf (you can ignore this!).

Anyway, VisualDL also allows us to read these files using its API. We may want to do this because we may need to produce Matplotlib plots using the generated data during the training and testing phases.

More specifically, we can read these logging data (previously logged to a file using LogWriter, as explained in the previous example) using LogReader.

The simple Python module read_visualdl_data.py does exactly this. The statements are quite self-explanatory.

But, in particular, I would like to note a few things. First, at line 6, 14, and 22, I am creating a LogReader but in a certain context or “mode”, and these modes correspond to the modes where the LogWriters (in the previous example) had been created (see the example above).

Note that the “ids” correspond to the variable epoch in the previous example.

Anyway, note that you should not run read_visualdl_data.py before write_visualdl_data.py (or, at least, before the log folder has been created and already contains the logging files).

VisualDL has a few problems

I have chosen VisualDL (as opposed to TNT and tensorboardX), but VisualDL has a few problems too. For some reason, at least in my case, the line charts are only displayed after a few minutes: more specifically, in the example above, only towards the end of the second epoch. See e.g. this Github issue. Even worse, I noticed that sometimes the line charts are not displayed at all (i.e. they are blank): in that case, I need to wait for the experiment to finish or I need to restart the VisualDL server in order to see them. I have also encountered a few weird runtime error messages on the terminal (similar to the one described in this issue), the causes of which I don’t yet know with certainty. Furthermore, I would like to note that I only tried to visualize line charts, so it is possible that you will encounter other problems while using other features of VisualDL. Finally, I would like to note that these problems can also be due to my inexperience with VisualDL (i.e. I may have done something wrong!).

Conclusion

In this blog post, I have briefly described three tools which can be used to visualize statistics of our PyTorch models, while they are being trained and tested. I particularly liked VisualDL, and so I provided two examples which show how to use VisualDL with PyTorch: one to visualize the actual statistics and the other to read them back from the file system. VisualDL is still in its infancy, but, hopefully, it will be improved and the bugs will be fixed.